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Status of the CP-PACS Project * 

Y. Iwasaki ^ for the CP-PACS Collaboration 

'^Center for Computational Physics and Institute of Physics, University of Tsukuba, Ibaraki 305, Japan 

The CP-PACS computer with a peak speed of 300 Gflops was completed in March 1996 and has started to 
operate. We describe the final specification and the hardware implementation of the CP-PACS computer, and its 
performance for QCD codes. A plan of the grade-up of the computer scheduled for fall of 1996 is also given. 



1. CP-PACS project 

The CP-PACS project § is a five-year project 
which formally started in 1992. The project cur- 
rently consists of 33 members in physics and com- 
puter science as listed in Ref. We selected 
Hitachi Ltd. as the industrial parter through a 
formal bidding process soon after the start of the 
project, and we have been working in a close col- 
laboration for the hardware and software devel- 
opment of the CP-PACS computer. The funda- 
mental design of the computer was laid down in 
1992, its details worked out in 1993, and the logi- 
cal design and the physical packaging design was 
completed in 1994. Chip fabrication and assem- 
bling of parts started in early 1995, and the CP- 
PACS computer with a peak speed of 300 Gflops 
was completed in March 1996. 

2. Hardware implementation 

A picture of CP-PACS computer is shown in 
Fig. § The size of the computer is roughly 
2mx4mx3m in height, width and depth. The 
floor-plan is depicted in Fig. ||. For the major ar- 
chitectural characteristics of the computer I refer 
to Ref. 0. 

The final specification of the CP-PACS com- 
puter is summarized in Table 0. The size of the 
second-level cache has been doubled since last 
year. The number for latency of data transfer rep- 
resents the measured value in the remote DMA 
mode, which is the fastest mode for data trans- 
fer, including software and hardware overheads 
and averaged over transfer through x, y and z 



Table 1 

Specification of the CP-PACS computer 



peak speed 
main memory 
parallel architecturi 

number of nodes 

node processor 
#FP registers 
clock cycle 
1st level cache 
2nd level cache 

network 

node array 
through-put 
latency 

distributed disks 
total capacity 

software 
OS 

language 
front end 



300Gflops(64 bit data) 
64GB 
e MIMD with 
distributed memory 
1024 

HP PA-RISCl.l+PVP-SW 
128 

150MHz 

16KB(I)-|-16KB(D) 
512KB(I)+512KB(D) 
3-d hyper-crossbar 
8 X 17 X 8* 
300MB/sec 
3 /isec 

3.5" RAID-5 disk 
529GB 

UNIX micro kernel 
FORTRAN, C, assembler 
main frame 
connected by HIPPI 



* Presented at Lattice 96, St. Louis, USA. 



* including nodes for I/O 

crossbar switches. 

Figure ^ shows the floor plan of the CPU 
chip which is fabricated by 0.3 micron CMOS 
semiconductor technology, with the size being 
1.57cmx 1.57cm. The PVP-SW feature, which 
enables vector calculations very effectively within 
the RISC architecture of CPU, is implemented 
with 128 floating-point registers in the green part 
at the lower right corner of Fig. ||. 

A silicon multichip module is depicted in Fig. ^ 
where the chip located at the center is the CPU 
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Figure 1. Outlook of the CP-PACS computer 
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Figure 2. Floor- plan of the CP-PACS computer 

chip. The adjacent two chips are the network in- 
terface adapter (NIA) and the storage controller 
(SC) which are fabricated by 0.5 micron gate- 
array technique. Twelve chips surrounding them 
are off-chip second-level cache made of SRAM. 
The size of the module is about 5.7 cm x 7.2 cm. 

One board which consists of 8 nodes is shown 
in Fig. p. The center of the white part of each 
unit corresponds to the multichip module shown 
in Fig. ^ now with fins for air-cooling. The 
black part is the main memory with 4 Mbit 
DRAM. The other three white chips on each unit 
are main-memory address/data controllers. In 
addition each board has two chips for crossbar 
switches in the x direction and one chip for clock 
distributer. The size of one board is 45.6 cm x 




Figure 3. Floor- plan of the CPU chip 

62.5 cm. Sixteen boards are installed in a crate 
and two crates are installed in a cabinet repre- 
sented by a square with symbol P in Fig. ||. 

The crossbar switches in the x direction are 
mounted on each board connecting 8 nodes, as 
explained above, those in the y direction placed 
on a back-plane located in four cabinets (symbol 
P in Fig. 1^) and those in the z direction mounted 
on a board which is housed in one cabinet (symbol 
Z in Fig. |). 

In the two cabinets with symbol lOU, adap- 
tors for I/O of data to the distributed disks are 
installed. Raid-5 disks which are connected by 
SCSI-2 bus through the adaptors are set in cabi- 
nets installed a few meters apart. 

3. Performance 

We write codes for lattice QCD with Fortran 
90 which includes libraries for data communica- 
tion. A Fortran compiler incorporating the PVP- 
SW feature has been newly developed, which pro- 
duces efficient object codes. The performance of 
the object code is typically 90 - 150 Mfiops per 
node, depending on the structure of the do-loop. 
The through-put of the data transfer between 
nodes with Fortran libraries, in the case of data 
of 576 Kbytes as an example, is 250 Mbytes/sec, 
which is to be compared with the peak through- 
put of 300 Mbytes/sec. 



Figure 4. Silicon multichip module of CPU 

The update time per link with a pseudo-heat 
bath method for one processor is 55.3 /xsec, which 
corresponds to 103 Mflops/PU. The performance 
for the Wilson quark matrix multiplication for 
the red-black algorithm is 96 Mflops/PU with 
the present code. On the other hand, we have 
a hand-optimized assembler code for the Wilson 
quark matrix multiplication with which the per- 
formance reaches 195 Mflops/PU, which is 65 % 
of the peak speed. We are now modifying this 
assembler code for the red-black algorithm. For 
MR red/black solver, the performance of the cal- 
culation part is 122 Mflops/PU, which reduces 
to 93 Mflops/PU when the data communication 
is included. The percentage of communication is 
about 20 % of the total time. In the case of the 
CG solver for KS fermion, the performance is 128 
Mflops for the case when the length of the inner 
most loop is 128. 

After checking the fundamental performance, 
we have performed a test of the computer as a 
whole with a quenched QCD spectrum calcula- 
tion with the Wilson quark action on a 64^ lat- 
tice at /3 = 6.0 for three hopping parameters 
(m^/mp ~ 0.7, 0.5, and 0.4), for two of which 
there exist already previous mass spectrum cal- 
culations. Results for the effective masses of 
hadrons for the smaller two hopping parameters 
are in good agreement with the previous results. 
This makes us confident that the machine is work- 
ing properly and that our codes are correct. 



Figure 5. One board consists of eight CPU units 

4. Grade-Up of the CP-PACS computer 

We plan to grade-up the CP-PACS to a peak 
speed of 600 Gflops and a memory size of 128 
Gbytcs, increasing the number of nodes from 1024 
to 2048 in the coming fall. The total funding ap- 
proved including that for the grade-up is 2.2 bil- 
lion yen (about 22 million US dollars). Until the 
grade-up we plan to run a quenched spectroscopy 
calculation with Wilson quarks at four values of 
(} in the range of rriTr/mp = 0.4 to 0.75 on lattices 
with a spatial size 3.0 fm. 

This work is supported by the Grant-in-Aid 
of Ministry of Education, Science and Culture 
(NO.07NP0401). 
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